2024-01-30
ols to fit a linear model
olssummary and Predictdfbetaols
helpdat, from the HELP study)Today’s data set comes from the Health Evaluation and Linkage to Primary Care trial, and is stored as HELPrct in the mosaicData package.
HELP was a clinical trial of adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care.
We will look at 453 subjects with complete data today.
| Variable | Description |
|---|---|
id |
subject identifier |
cesd |
Center for Epidemiologic Studies Depression measure (higher scores indicate more depressive symptoms) |
age |
subject age (in years) |
sex |
female (n = 107) or male (n = 346) |
subst |
primary substance of abuse (alcohol, cocaine or heroin) |
mcs |
SF-36 Mental Component Score (lower = worse status) |
pcs |
SF-36 Physical Component Score (lower = worse status) |
pss_fr |
perceived social support by friends (higher = more support) |
helpdat categorical variableshelpdat |> tabyl(sex, subst) |>
adorn_totals(where = c("row", "col")) |>
adorn_percentages() |>
adorn_pct_formatting() |>
adorn_ns(position = "front") |>
adorn_title(placement = "combined") |>
kable(align = 'lrrrr')| sex/subst | alcohol | cocaine | heroin | Total |
|---|---|---|---|---|
| female | 36 (33.6%) | 41 (38.3%) | 30 (28.0%) | 107 (100.0%) |
| male | 141 (40.8%) | 111 (32.1%) | 94 (27.2%) | 346 (100.0%) |
| Total | 177 (39.1%) | 152 (33.6%) | 124 (27.4%) | 453 (100.0%) |
helpdat quantitative variables
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 cesd integer 1.0 25 34 41 60 32.8 12.5 453 0
2 age integer 19.0 30 35 40 60 35.7 7.7 453 0
3 mcs numeric 6.8 22 29 41 62 31.7 12.8 453 0
4 pcs numeric 14.1 40 49 57 75 48.0 10.8 453 0
5 pss_fr integer 0.0 3 7 10 14 6.7 4.0 453 0
helpdat$cesd : CESD at baseline
n missing distinct Info Mean Gmd .05 .10
453 0 58 0.999 32.85 14.23 10.0 15.2
.25 .50 .75 .90 .95
25.0 34.0 41.0 49.0 52.4
lowest : 1 3 4 5 6, highest: 55 56 57 58 60
Info measures the variable’s information between 0 and 1: the higher the Info, the more continuous the variable is (the fewer ties there are.)Gmd = Gini’s mean difference, a robust measure of variation. If you randomly selected two of the 453 subjects many times, the mean difference in cesd would be 14.23 points.tibble [453 × 8] (S3: tbl_df/tbl/data.frame)
$ id : int [1:453] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "subject ID"
$ cesd : int [1:453] 49 30 39 15 39 6 52 32 50 46 ...
..- attr(*, "label")= chr "CESD at baseline"
$ age : int [1:453] 37 37 26 39 32 47 49 28 50 39 ...
..- attr(*, "label")= chr "age (years)"
$ sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
..- attr(*, "label")= chr "sex"
$ subst : Factor w/ 3 levels "alcohol","cocaine",..: 2 1 3 3 2 2 2 1 1 3 ...
..- attr(*, "label")= chr "primary substance of abuse"
$ mcs : num [1:453] 25.11 26.67 6.76 43.97 21.68 ...
..- attr(*, "label")= chr "SF-36 Mental Component Score"
$ pcs : num [1:453] 58.4 36 74.8 61.9 37.3 ...
..- attr(*, "label")= chr "SF-36 Physical Component Score"
$ pss_fr: int [1:453] 0 1 13 11 10 5 1 4 5 0 ...
..- attr(*, "label")= chr "perceived social support by friends"
Note that we’re placing the outcome (cesd) last (result on next slide.)
ols to fit a linear regression modelolsThe ols function stands for ordinary least squares and comes from the rms package, by Frank Harrell and colleagues. Any model fit with lm can also be fit with ols.
var_y using var_x from the my_tibble data, we would use the following syntax:This leaves the following questions:
datadist stuff doing?x = TRUE, y = TRUE in the fit?datadist?Before we fit any ols model to data from my_tibble, we’ll use:
Run (the datadist code above) once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables.
datadist documentationx = TRUE, y = TRUE in the fit?Once we’ve set up the distribution summaries with the datadist code, we fit linear regression models using the same fitting routines as lm with ols:
ols stores additional information beyond what lm doesx = TRUE and y = TRUE save even more expanded information that we’ll need in building plots and summaries of the fit.x = FALSE, y = FALSE, but in 432, we’ll always want these to be saved.ols to fit a Two-Predictor ModelNow, we’ll fit an ols model predicting our outcome (cesd) using two predictors (mcs and subst) using the helpdat tibble.
datadistx = TRUE, y = TRUEmod1?
g?olsFor our mod1,
LR chi2 = 295.10, d.f. = 3, Pr(> chi2) = 0.0000The log of the likelihood ratio, multiplied by -2, yields a test against a \(\chi^2\) distribution. Interpret this as a goodness-of-fit test that compares mod1 to a null model with only an intercept term. In ols this is similar to a global (ANOVA) F test.
g = 9.827.cesd will be 9.827.cesd values, from describe, which was Gmd = 14.23.ols fit| – | index.orig | training | test | optimism | index.corrected | n |
|---|---|---|---|---|---|---|
| \(R^2\) | 0.4787 | 0.4874 | 0.4737 | 0.0137 | 0.4650 | 40 |
index.orig for \(R^2\) is 0.4787. That’s what we get from the data we used to fit the model, and is what we see in our standard output.validate we create 40 (by default) bootstrapped resamples of the data and then split each of those into training and test samples.
training sample to obtain \(R^2\): mean across 40 splits is 0.4874test sample: average \(R^2\) was 0.4737optimism = training result - test result = 0.0137index.corrected = index.orig - optimism = 0.4650While our nominal \(R^2\) is 0.4787 for mod1, correcting for optimism yields a validated \(R^2\) of 0.4650
mod1 will perform in new data.mod1 fit by olsanova after lm.lm, this is a sequential ANOVA table, so if we had included subst in the model first, we’d get a different SS, MS, F and p for mcs and subst, but the same REGRESSION and ERROR results.mod1 fit by olssubst effects estimated by this model?
subst being cocaine instead of alcohol on ces_d is -3.44 assuming no change in mcs, with 90% CI (-5.10, -1.79).subst being heroin instead of alcohol on ces_d is -1.78 assuming no change in mcs, with 90% CI (-3.54, -0.02).But what about the mcs effect?
mod1 fit by olsmcs: -12.66 is the estimated change in cesd associated with a move from mcs = 21.68 (see Low value) to mcs = 40.94 (the High value) assuming no change in subst.ols chooses the Low and High values from the interquartile range.mcs on cesd holding subst at its baseline level (alcohol).subst on cesd holding mcs at its median value which is 28.602417.ols fitFor complex models (this model isn’t actually very complex) it can be helpful to have a tool that will help you see the modeled effects in terms of their impact on the predicted outcome.
A nomogram is an established graphical tool for doing this.
cesd for this subject.mod1 fitPredicted cesd for a subject with mcs = 35 and subst = heroin?
predict function for our ols fit provides fitted values.broom package can also support rms fitsmod1We would like our model to be well-calibrated, in the following sense…
We’d like to look at the relationship between the observed cesd outcome and our predicted cesd from the model.
x = TRUE, y = TRUE in the ols fit.mod1
n=453 Mean absolute error=0.128 Mean squared error=0.02428
0.9 Quantile of absolute error=0.267
mod1?The dfbeta value for a particular subject and coefficient \(\beta\) is the change in the coefficient that happens when the subject is excluded from the model.
$Intercept
[1] "8" "351" "405" "433"
$mcs
[1] "351" "402" "450"
$subst
[1] "351"
dfbetas that exceed the specified cutoff (default is 0.2 but it’s an arbitrary choice.)w <- which.influence(mod1, cutoff = 0.2)
d <- helpdat |> select(mcs, subst, cesd) |> data.frame()
show.influence(w, d) Count mcs subst
8 1 9.16053 alcohol
351 3 *57.48944 *heroin
402 1 *55.47938 alcohol
405 1 15.07887 alcohol
433 1 18.59431 alcohol
450 1 *62.17550 alcohol
helpdat |> slice(351) to see row 351 in its entirety.lm fit) to check Cook’s distances.olsmod1To fit more complete residual plots (and other things) we can fit the lm version of this same model…
mod1In building a linear regression model, we’re most often going to be thinking about:
A polynomial in the variable x of degree D is a linear combination of the powers of x up to D. Fitting such a model creates a polynomial regression.
An orthogonal polynomial sets up a model design matrix and then scales those columns so that each column is uncorrelated with the previous ones.
pcs to predict cesd?pcs and cesdcesd with pcsolspol() from the rms package here to fit orthogonal polynomials, rather than poly() which we used for an lm fit.pcs)
pcs)
pcs)
First, we need to store the values. Again broom doesn’t play well with ols fits, so I’ll just add the predictions as columns
ggplot(cesd_fits, aes(x = pcs, y = cesd)) +
geom_point() +
geom_line(aes(x = pcs, y = fitB1),
col = "blue", size = 1.25) +
geom_line(aes(x = pcs, y = fitB2),
col = "black", size = 1.25) +
geom_line(aes(x = pcs, y = fitB3),
col = "red", size = 1.25) +
geom_text(x = 18, y = 47, label = "Linear Fit",
size = 5, col = "blue") +
geom_text(x = 18, y = 39, label = "Quadratic Fit",
size = 5, col = "black") +
geom_text(x = 18, y = 26, label = "Cubic Fit",
size = 5, col = "red") +
labs(title = "Linear, Quadratic and Cubic Fits for cesd using pcs") The most common choices are 3, 4, or 5 knots.
olsLet’s consider a restricted cubic spline model for cesd based on pcs with:
modC3, 4 knots in modC4, and 5 knots in modC5pcs)
pcs)
pcs)
p1 <- ggplot(Predict(mod_B1)) + ggtitle("B1: Linear")
p2 <- ggplot(Predict(mod_B2)) + ggtitle("B2: Quadratic")
p3 <- ggplot(Predict(mod_B3)) + ggtitle("B3. Cubic")
p4 <- ggplot(Predict(mod_C3)) + ggtitle("C3: 3-knot RCS")
p5 <- ggplot(Predict(mod_C4)) + ggtitle("C4. 4-knot RCS")
p6 <- ggplot(Predict(mod_C5)) + ggtitle("C5. 5-knot RCS")
(p1 + p2 + p3) / (p4 + p5 + p6)cesd to pcs associationset.seed(432) then validate(mod_B1) etc.| Model | Index-Corrected \(R^2\) | Corrected MSE |
|---|---|---|
| B1 (linear) | 0.0848 | 143.25 |
| B2 (quadratic) | 0.0752 | 142.49 |
| B3 (cubic) | 0.0909 | 143.73 |
| C3 (3-knot RCS) | 0.0732 | 143.31 |
| C4 (4-knot RCS) | 0.0870 | 144.00 |
| C5 (5-knot RCS) | 0.0984 | 141.44 |
Spearman’s \(\rho^2\) is an indicator (not a perfect one) of potential predictive punch, but doesn’t give away the game.
mcs is the most attractive candidate for a non-linear term, as it packs the most potential predictive punch, so if it does turn out to need non-linear terms, our degrees of freedom will be well spent.
mcs actually needs a non-linear term, or will show meaningfully better results if a non-linear term is included. We’d have to fit a model with and without non-linearity in mcs to know that.pcs, also quantitative, has the next most potential predictive punchpss_fr and sex.With 453 observations (452 df) we should be thinking about models with modest numbers of regression inputs.
In this case, we might choose to include non-linear terms in just two or three variables (and that’s it) and even that would be tough to justify with this modest sample size.
spear_cesd
Spearman rho^2 Response variable:cesd
rho2 F df1 df2 P Adjusted rho2 n
mcs 0.438 350.89 1 451 0.0000 0.436 453
subst 0.034 7.97 2 450 0.0004 0.030 453
pcs 0.089 44.22 1 451 0.0000 0.087 453
age 0.000 0.12 1 451 0.7286 -0.002 453
sex 0.033 15.56 1 451 0.0001 0.031 453
pss_fr 0.033 15.57 1 451 0.0001 0.031 453
Fit a model to predict cesd using:
mcspcspss_fragesex with the main effect of mcs (restricting our model so that terms that are non-linear in both sex and mcs are excluded), andsubstPerhaps more than we can reasonably do with 453 observations, but let’s see how it looks.
mod2%ia% tells R to fit an interaction term with sex and the main effect of mcs.sex as a main effect for the interaction term (%ia%) to work here.mod2
mod2
summary results for mod2
summary results for mod2mod2
n=453 Mean absolute error=0.386 Mean squared error=0.19775
0.9 Quantile of absolute error=0.704
lm for fitting complex linear modelsWe can certainly assess this big, complex model using lm, too:
But to really delve into the details of how well this complex model works, and to help plot what is actually being fit, we’ll probably want to fit the model using ols.
lm and others that are most easily obtained using ols.432 Class 05 | 2024-01-30 | https://thomaselove.github.io/432-2024/